Spatiotemporal Patterns of Influenza Activity in the United States

BMIN503/EPID600 Final Project

Author

Grace Li

Published

December 10, 2023


Overview

Season influenza viruses continues to be a major global burden. There are around a billion cases annually and up to 650,000 people die of influenza every year. Influenza activity patterns vary geographically and temporally, so we want to examine the spatiotemporal patterns of influenza activity in the United States and determine if variations in influenza activities across states or regions are associated with distribution of circulating virus strains. I will be using surveillance data published by the Centers for Disease Control and Prevention (CDC) and viral sequences from the Global Initiative on Sharing All Influenza Data (GIDSAID) to answer these questions. Understanding patterns of seasonal influenza activity can help guide early influenza surveillance and inform influenza prevention and control programs. Materials for this project can be found in this GitHub repository.

Introduction

Influenza has a large seasonal burden in the United States: up to 35 million people are infected annually causing between 12,000 and 56,000 deaths. In the U.S., flu season typically starts on the 40th week of the year (1st week of October), but influenza activity including time of onset, duration, and number of peaks varies greatly across the states. The CDC routinely publishes weekly reports of influenza surveillance data providing a national picture of influenza virus activities in the U.S. One aspect of these surveillance is performed by clinical laboratories across the country determining the percentage of specimens positive for influenza virus. These test results allow us to evaluate if influenza cases are increasing or decreasing.

There are four types of influenza viruses, but influenza A and B viruses are responsible for seasonal epidemics in the human population. Influenza A viruses are divided into subtypes based on the two surface glycoproteins – hemagglutinin and neuraminidase. H1N1 and H3N2 are the main influenza A subtypes that cause infection in people. Similarly, influenza B viruses can be divided into lineages. Influenza viruses can be further subdivided into clades and subclades based on their hemagglutinin gene sequences for virus characterization. Efforts by public health laboratories to sequence influenza genomes have allowed researchers to perform genetic characterization of influenza viruses using public databases. Influenza viruses are constantly changing, allowing the introduction of new clades and extinction of less fit strains. With sequencing data, we can determine influenza clade distribution by examining the number of clades present and the proportion of each clade at a given time point. It is unclear if clade distribution pattern is associated with influenza activity such as flu season duration. By linking regional epidemiologic data to circulating influenza sequencing data, we hope to identify patterns of clade distribution associated with influenza activities in the U.S. An understanding of the spatial and temporal spread of influenza can help inform public health decisions in influenza prevention and control.

Methods

Data source

Datasets Time range Source
Influenza surveillance 2010-present FluView
Influenza sequencing data 2010-10-01 to 2019-10-01 GISAID

Influenza surveillance

We obtained influenza surveillance data from the CDC Weekly U.S. Influenza Surveillance Report. Viral surveillance from approximately 300 clinical laboratories in the U.S. World Health Organization (WHO) Collaborating Laboratories System and the National Respiratory and Enteric Virus Surveillance System (NREVSS) is reported weekly and data can be accessed through FluView Interactive. Data collection from both the U.S. WHO Collaborating Laboratories System and NREVSS began during the 1997-98 season. The total number of respiratory specimens tested and %positive rate for influenza are reported weekly.

For this project, I downloaded data from all 50 states starting from the 2010-11 season. Data from clinical laboratories include the weekly total number of specimens tested, the number of positive influenza tests, and the percent positive by influenza type.

GISAID

GISAID, the Global Initiative on Sharing All Influenza Data, is a global database that provides access to genomic data of influenza viruses and other pathogens including coronaviruses and poxvirus. Add more here!

Within the EpiFlu database in GISAID, I searched for human influenza A and influenza B virus isolates collected in the United States from October 2010 to October 2019. I further filtered isolate selection by requiring the isolate to contain complete sequences of both the hemagglutinin and neuraminidase segment and not have been passaged in cells or eggs. I downloaded the virus metadata that contains information including isolate name, subtype, and clade.

Download and clean data

First, we will download the weekly U.S. Influenza Surveillance Report.

# Load WHO/NREVSS surveillance data from clinical labs
clinical_labs <- read_csv("WHO_NREVSS_Clinical_Labs.csv")

Next, I will clean the surveillance data by removing unused variables and renaming variables. Influenza season typically starts on week 40 (first week of October), so I will include a variable named “ordered_week” that arrange weeks starting with week 40.

# Rename variables and order week variable
clinical_labs <- clinical_labs |>
        dplyr::select(-c("REGION TYPE")) |>
        dplyr::rename(state = REGION, 
               year = YEAR,
               week = WEEK,
               total_specimens = `TOTAL SPECIMENS`, 
               total_A = `TOTAL A`,
               total_B = `TOTAL B`,
               percent_positive = `PERCENT POSITIVE`,
               percent_A = `PERCENT A`,
               percent_B = `PERCENT B`) |> 
        mutate(ordered_week = (week - 40) %% 52 + 1) |>
        mutate(season = ifelse(week >= 40, year, year - 1)) |>
        mutate(ordered_week = as.factor(ordered_week)) |>
        mutate(percent_positive = as.numeric(percent_positive))

Since I will be drawing maps, I will retrieve shapefile for all states using tigris() and obtain information on states using the R Datasets Package.

# Retrieve shapefile for all states using tigirs() 
states <- states()

# Obtain state information on names and division using the package datasets
states_division <- data.frame(NAME = datasets::state.name, 
                              division = datasets::state.division)

# Merge shapefile with state information 
states <- inner_join(states, states_division, by = "NAME") |>
            rename(state = NAME)

Now I will merge the shapefile data frame with surveillance data.

# Include state division info in clinical_labs df  
clinical_labs <- inner_join(states, clinical_labs, by = "state")

Finally, I am going to download influenza sequencing data from GISAID.

# Load data for influenza A and influenza B viruses
gisaid_a <- read_csv("gisaid_epiflu_isolates_fluA.csv")
gisaid_b <- read_csv("gisaid_epiflu_isolates_fluB.csv")

# Combine data for influenza A and influenza B viruses
gisaid <- bind_rows(gisaid_a, gisaid_b)

# Select relevant variables and rename variables
gisaid <- gisaid |>
            dplyr::select(c("Isolate_Id", "Isolate_Name", 
                            "Subtype", "Clade", "Collection_Date")) |>
            rename(id = Isolate_Id, 
                   name = Isolate_Name,
                   subtype = Subtype,
                   clade = Clade,
                   collection_date = Collection_Date) 

# Convert 'collection_date' to Date format
gisaid$collection_date <- as.Date(gisaid$collection_date)

# Create a new column 'week' indicating the week of the year
gisaid$week <- week(gisaid$collection_date)

# Create a new column 'season' indicating the flu season from which the isolate was sequenced
gisaid$season <- ifelse(gisaid$week >= 40, year(gisaid$collection_date), 
                        year(gisaid$collection_date) - 1)

# Create a new column 'ordered_week' for graphing and convert 'ordered_week' to numeric type
gisaid <- gisaid |>
            mutate(ordered_week = (week - 40) %% 52 + 1) |>
            mutate(ordered_week = as.numeric(ordered_week))

# Create a new column 'state' to indicate where the virus isolate came from
gisaid <- gisaid |>
  mutate(state = sapply(strsplit(as.character(name), "/"), `[`, 2))

head(gisaid)
# A tibble: 6 × 9
  id         name  subtype clade collection_date  week season ordered_week state
  <chr>      <chr> <chr>   <chr> <date>          <dbl>  <dbl>        <dbl> <chr>
1 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
2 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-11-23         47   2010            8 Penn…
3 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
4 EPI_ISL_9… A/Fl… A / H3… 3C.2  2010-12-09         49   2010           10 Flor…
5 EPI_ISL_9… A/No… A / H3… 3C.2  2011-02-16          7   2010           20 Nort…
6 EPI_ISL_9… A/Te… A / H3… 3C.2  2011-02-09          6   2010           19 Texas

Next we will clean up the GISAID dataset.

# Remove rows where 'subtype' is "A", "A / H1N2", or "A / H3" (incomplete sequencing)
gisaid <- gisaid |> 
            filter(!(subtype %in% c("A", "A / H1N2", "A / H3")))

# Remove rows where 'clade' or 'collection_date' is NA or if clade is unassigned
gisaid <- gisaid |>
            filter(!is.na(clade) & !is.na(collection_date)) |>
            filter(!(clade %in% c("unassigned")))

# Define a vector containing the names of all 50 states
states_list <- c(
  "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
  "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
  "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
  "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi",
  "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey",
  "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma",
  "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
  "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia",
  "Wisconsin", "Wyoming"
)

# Keep only rows where the "state" column matches any state in the 50 states (exclude entries like New Yock City or District of Columbia)
gisaid <- gisaid |>
            filter(state %in% states_list)

head(gisaid)
# A tibble: 6 × 9
  id         name  subtype clade collection_date  week season ordered_week state
  <chr>      <chr> <chr>   <chr> <date>          <dbl>  <dbl>        <dbl> <chr>
1 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
2 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-11-23         47   2010            8 Penn…
3 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
4 EPI_ISL_9… A/Fl… A / H3… 3C.2  2010-12-09         49   2010           10 Flor…
5 EPI_ISL_9… A/No… A / H3… 3C.2  2011-02-16          7   2010           20 Nort…
6 EPI_ISL_9… A/Te… A / H3… 3C.2  2011-02-09          6   2010           19 Texas

Results

Exploratory data analysis

CDC surveillance data

I will start with some exploratory data analysis by looking at the CDC surveillance data in the 2017-18 flu season.

# Filter data from the 2017-18 season 
season2017 <- clinical_labs |>
                filter(season == "2017")

# Set up theme for graphing maps 
my_theme <- function() {
  theme_minimal() +                                  
  theme(axis.line = element_blank(),                 
        axis.text = element_blank(),                 
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  
        legend.key.size = unit(0.4, "cm"),          
        legend.text = element_text(size = 8),       
        legend.title = element_text(size = 10),
        plot.title = element_text(size = 12))      
}

# Create a color palette function
myPalette <- colorRampPalette(brewer.pal(9, "YlOrRd"))

# Function to create a single map for a given week
create_weekly_map <- function(week_data) {
  
  weekly_map <- week_data |> 
    st_simplify() |>
    ggplot() +
    geom_sf(aes(fill = percent_positive)) + 
    my_theme() +
    labs(title = paste("Week", unique(week_data$week))) +
    scale_fill_gradientn(name = "% Positive", colours = myPalette(100),
                         limit = range(0, 70))
}

# Create a list to store individual plots
season2017_map_list <- list()

# Create plots for each week
unique_week <- unique(season2017$week)

for (ii in unique_week) {
  # Filter data for the specific week and shift geometry
  week_data <- season2017 |> 
    filter(week == ii) |>
    shift_geometry()
  
  # Create a weekly map and add it to the list
  weekly_map <- create_weekly_map(week_data)
  
  season2017_map_list[[length(season2017_map_list) + 1]] <- weekly_map
}
# Create animation using gifski()
for (i in seq_along(season2017_map_list)) {
  print(season2017_map_list[[i]])
}

Overall, we can see that the timing and duration of flu seasons varied in different parts of the country in the 2017-18 season. In some states, flu season began as early as October and November while some states had a later start but continued to have flu circulation until May. We can also see some of the potential issues we might run into using the CDC FluView data. For example, surveillance data is incomplete for some of the states or even completely absent in several states, and this problem is consistent across flu seasons. This is both an artifact of how I selected the surveillance data and the overall surveillance system in the United States.

Next let’s take a closer look of the CDC surveillance data by state.

# Percent positive cases in 2017-18 season 
season2017 |> 
  ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        facet_wrap(~ state, nrow = 5, ncol = 10) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette = "GnBu") +
        scale_x_discrete(breaks = c(1, 14, 27, 40, 52), 
                         labels = c(40, 1, 14, 27, 39)) +
        scale_y_continuous() +
        labs(x = "Week", y = "Percent Positive", 
             title = "2017-18 Influenza Positive Tests Reported to CDC") +
        theme_classic() +
        theme(legend.position = "none",
              strip.background = element_blank(),
              panel.border = element_rect(color = "black", fill = NA))

It is clear that influenza activity varied greatly across states in the 2017-18 season. States like Nevada had pretty early influenza detection whereas some states did not experience peak flu activity until later. Flu activity remained elevated in the spring in some regions such as the East North Central states. In addition, some states have longer flu season than others and some experienced several waves of influenza activity.

The U.S. Census Bureau has grouped the 50 states and the District of Columbia into nine divisions based on geographic proximity. I will also look at the surveillance data within each division to see if states in close proximity might exhibit similar flu activity patterns.

# Get unique divisions
unique_divisions <- unique(season2017$division)

# Create a list to store individual plots
season2017_plot_list <- list()

# Create plots for each division
for (i in unique_divisions) {
        
  division_data <- season2017 |> 
          filter(division == i)
  
  p <- ggplot(division_data, aes(x = ordered_week, y = percent_positive, fill = state)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ state, ncol = 4, nrow = 2, scales = "free") +
    scale_fill_manual(values = wes_palette("Zissou1", n = 8, type = "continuous")) +
    scale_x_discrete(breaks = c(1, 14, 27, 40, 52), labels = c(40, 1, 14, 27, 39)) +
    scale_y_continuous() +
    ylim(0, 66) +
    labs(x = "Week", y = "Percent Positive", title = i) +
    theme_classic() +
    theme(legend.position = "none",
          strip.background = element_blank(),
          panel.border = element_rect(color = "black", fill = NA),
          aspect.ratio = 1)
 season2017_plot_list[[length(season2017_plot_list) + 1]] <- p
}

print(season2017_plot_list)
[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]

In some divisions, flu activity was very similar between states. For example, New York and Pennsylvania had similar pattern of flu activities, peaking around week 5. East North Central states also had relatively homogeneous flu activity.

In contrast, states in the Mountain/West region had pretty distinct patterns. Nevada was one of the first states to start the 2017-18 flu season but flu activity did not start until later in the year in Utah. Interestingly, Idaho experienced two waves of influenza activity.

# Percent positive cases in 2017-18 season in Mountain
season2017 |>
        filter(division == "Mountain") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = state)) +
        geom_bar(stat = "identity") +
        facet_wrap(~ state) +
        scale_fill_manual(values = c('#855C75', '#D9AF6B', '#AF6458', '#736F4C', '#526A83', '#625377', '#68855C', 'deepskyblue4')) +
        scale_x_discrete(breaks = c(1, 14, 27, 40, 52), labels = c(40, 1, 14, 27, 39)) +
        scale_y_continuous() +
        labs(x = "Week", y = "Percent Positive", title = "2017-18 Mountain Influenza Activity") +
        theme_classic() +
        theme(legend.position = "none",
              strip.background = element_blank(),
              panel.border = element_rect(color = "black", fill = NA))

# Percent positive cases in 2017-18 season in Idaho
state_percent_plot <- function(state_of_interest) {season2017 |>
        filter(state == state_of_interest) |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        scale_fill_manual(values = "#AF6458") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous() +
        labs(x = "Week", y = "Percent Positive", 
             title = "2017-18 Idaho Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")
}
GISAID sequencing data

Now let’s take a look at the influenza sequencing data retrieved from GISAID.

# Group by 'subtype' and count the occurrences of each subtype
gisaid |>
  group_by(subtype) |>
  count()
# A tibble: 3 × 2
# Groups:   subtype [3]
  subtype      n
  <chr>    <int>
1 A / H1N1  5850
2 A / H3N2 10369
3 B         2206
# Create a table summary by 'state'
gisaid |> 
  select(-c(id, name, collection_date, week, clade, ordered_week, season)) |> 
  tbl_summary(by = state)
Characteristic Alabama, N = 1381 Alaska, N = 4981 Arizona, N = 3711 Arkansas, N = 1241 California, N = 7821 Colorado, N = 2911 Connecticut, N = 2331 Delaware, N = 2781 Florida, N = 7611 Georgia, N = 2041 Hawaii, N = 4871 Idaho, N = 1901 Illinois, N = 2791 Indiana, N = 2011 Iowa, N = 3261 Kansas, N = 1071 Kentucky, N = 1661 Louisiana, N = 2671 Maine, N = 1611 Maryland, N = 3111 Massachusetts, N = 1661 Michigan, N = 1,2751 Minnesota, N = 2561 Mississippi, N = 1791 Missouri, N = 1701 Montana, N = 2891 Nebraska, N = 1111 Nevada, N = 1591 New Hampshire, N = 1481 New Jersey, N = 2271 New Mexico, N = 1971 New York, N = 5561 North Carolina, N = 2401 North Dakota, N = 1861 Ohio, N = 3191 Oklahoma, N = 1631 Oregon, N = 1451 Pennsylvania, N = 1,3011 Rhode Island, N = 891 South Carolina, N = 1401 South Dakota, N = 2541 Tennessee, N = 2681 Texas, N = 1,3501 Utah, N = 2911 Vermont, N = 1231 Virginia, N = 2851 Washington, N = 1,5031 West Virginia, N = 1401 Wisconsin, N = 1,5951 Wyoming, N = 1251
subtype

















































    A / H1N1 48 (35%) 84 (17%) 127 (34%) 42 (34%) 315 (40%) 109 (37%) 70 (30%) 94 (34%) 222 (29%) 75 (37%) 148 (30%) 59 (31%) 94 (34%) 73 (36%) 89 (27%) 46 (43%) 60 (36%) 58 (22%) 52 (32%) 102 (33%) 66 (40%) 342 (27%) 92 (36%) 47 (26%) 54 (32%) 103 (36%) 38 (34%) 70 (44%) 43 (29%) 72 (32%) 94 (48%) 163 (29%) 75 (31%) 74 (40%) 114 (36%) 61 (37%) 51 (35%) 467 (36%) 43 (48%) 35 (25%) 74 (29%) 83 (31%) 359 (27%) 102 (35%) 36 (29%) 116 (41%) 465 (31%) 41 (29%) 457 (29%) 46 (37%)
    A / H3N2 79 (57%) 388 (78%) 162 (44%) 70 (56%) 374 (48%) 146 (50%) 95 (41%) 125 (45%) 363 (48%) 113 (55%) 233 (48%) 102 (54%) 131 (47%) 98 (49%) 187 (57%) 52 (49%) 71 (43%) 115 (43%) 78 (48%) 139 (45%) 75 (45%) 847 (66%) 113 (44%) 113 (63%) 99 (58%) 145 (50%) 61 (55%) 75 (47%) 79 (53%) 122 (54%) 87 (44%) 336 (60%) 129 (54%) 87 (47%) 184 (58%) 91 (56%) 77 (53%) 760 (58%) 37 (42%) 80 (57%) 127 (50%) 167 (62%) 884 (65%) 127 (44%) 66 (54%) 111 (39%) 982 (65%) 76 (54%) 1,034 (65%) 77 (62%)
    B 11 (8.0%) 26 (5.2%) 82 (22%) 12 (9.7%) 93 (12%) 36 (12%) 68 (29%) 59 (21%) 176 (23%) 16 (7.8%) 106 (22%) 29 (15%) 54 (19%) 30 (15%) 50 (15%) 9 (8.4%) 35 (21%) 94 (35%) 31 (19%) 70 (23%) 25 (15%) 86 (6.7%) 51 (20%) 19 (11%) 17 (10%) 41 (14%) 12 (11%) 14 (8.8%) 26 (18%) 33 (15%) 16 (8.1%) 57 (10%) 36 (15%) 25 (13%) 21 (6.6%) 11 (6.7%) 17 (12%) 74 (5.7%) 9 (10%) 25 (18%) 53 (21%) 18 (6.7%) 107 (7.9%) 62 (21%) 21 (17%) 58 (20%) 56 (3.7%) 23 (16%) 104 (6.5%) 2 (1.6%)
1 n (%)
# Create a table summary by 'season'
gisaid |> 
  select(-c(id, name, collection_date, week, clade, ordered_week, state)) |> 
  tbl_summary(by = season)
Characteristic 2010, N = 131 2012, N = 51 2013, N = 111 2014, N = 6131 2015, N = 2,5351 2016, N = 4,0881 2017, N = 4,4631 2018, N = 6,6841 2019, N = 131
subtype








    A / H1N1 2 (15%) 3 (60%) 1 (9.1%) 9 (1.5%) 1,135 (45%) 403 (9.9%) 1,268 (28%) 3,027 (45%) 2 (15%)
    A / H3N2 11 (85%) 2 (40%) 10 (91%) 594 (97%) 866 (34%) 2,876 (70%) 2,854 (64%) 3,148 (47%) 8 (62%)
    B 0 (0%) 0 (0%) 0 (0%) 10 (1.6%) 534 (21%) 809 (20%) 341 (7.6%) 509 (7.6%) 3 (23%)
1 n (%)

From the summaries, we can see that we have more influenza A sequences than influenza B sequences in the United States between 2010 to 2019, which is consistent with influenza circulation pattern since influenza A is more common overall. Similar to the issue we encountered with the surveillance data, we will also have varied numbers of isolate sequences from state to state depending on the number of laboratories and resources allocated for influenza genome sequencing. My selection criteria might have also introduced bias in sampling. Finally, if we break it down by seasons, we can see that we have more sequencing data from 2015 to 2018, so I will focus my analysis on those flu seasons.

Duration of flu season and clade distribution

I want to start by determining which states had the longest/shortest flu season for every season between 2015 and 2019. To calculate this, I am counting the number of weeks where more than 10% of specimens tested are positive for influenza virus for each state.

# | label: flu_season_duration

# Define the season of interest
seasons <- 2015:2018

# Extract relevant columns from clinical_labs and convert from sf to dataframe
clinical_labs_df <- clinical_labs |>
                      select(week, state, year, season,
                             percent_positive, ordered_week) |>
                      st_drop_geometry() # drops geometry and reclasses df
  

# Initialize an empty data frame to store flu season duration results
flu_duration <- data.frame()  

# Iterate over each season to find states with maximum and minimum flu season durations
for (ii in seasons) {
  
  # Filter and summarize data for each state in the current season
  season_data <- clinical_labs_df |>
                  filter(season == ii) |>
                  group_by(state) |>
                  summarize(flu_season_duration = sum(percent_positive > 10, 
                                                      na.rm = TRUE))
  
  # Find the state with the maximum flu season duration
  max_duration_state <- season_data |>
                          filter(flu_season_duration == 
                                 max(flu_season_duration, na.rm = TRUE)) |>
                          select(state)
  
  # Find the state with the minimum flu season duration
  min_duration_state <- season_data |>
                          filter(flu_season_duration == 
                                   min(flu_season_duration, na.rm = TRUE)) |>
                          select(state)

  # Append results to the flu_duration data frame
  flu_duration <- bind_rows(flu_duration, 
                            data.frame(Season = ii, 
                                       Max_State = max_duration_state$state, 
                                       Min_State = min_duration_state$state))
}

flu_duration
   Season Max_State     Min_State
1    2015    Oregon       Florida
2    2015    Oregon      Missouri
3    2015    Oregon    New Jersey
4    2015    Oregon  Rhode Island
5    2016     Maine        Alaska
6    2016     Maine       Florida
7    2016     Maine    New Jersey
8    2016     Maine  Rhode Island
9    2016     Maine       Wyoming
10   2017    Oregon       Florida
11   2017    Oregon New Hampshire
12   2017    Oregon    New Jersey
13   2017    Oregon  Rhode Island
14   2018   Vermont       Florida
15   2018   Vermont New Hampshire
16   2018   Vermont    New Jersey
17   2018   Vermont  Rhode Island
18   2018   Vermont       Wyoming

However, due to unequal sample sizes between states in each season, I want to make sure that the number of missing values is not biasing our calculation above in finding the state with the longest and shortest duration of flu activity. Therefore, I am also calculating the ratio of weeks with greater than 10% positive influenza rate to the total number of weeks with non-missing values in order to account for states having different numbers of missing values. A state must also have data points for more than 42 weeks out of the 52 weeks in a season to be considered in the calculation.

# | label: flu_season_duration_ratio

# Initialize an empty data frame to store flu season duration ratio results
flu_duration_ratio <- data.frame()

# Iterate over each season to find states with maximum and minimum flu season duration ratios
for (ii in seasons) {
  
  # Filter and summarize data for each state in the current season
  season_data <- clinical_labs_df |>
                  filter(season == ii) |>
                  group_by(state) |>
                  summarize(
                    total_weeks = sum(!is.na(percent_positive)),
                    flu_season_duration = sum(percent_positive > 10, na.rm = TRUE)
                    ) |>
                  mutate(ratio = flu_season_duration / total_weeks)
  
  # Find the state with the maximum flu season duration ratio
  max_ratio_state <- season_data |>
                      filter(total_weeks > 42) |>
                      filter(ratio == max(ratio, na.rm = TRUE)) |>
                      select(state)
  
  # Find the state with the minimum flu season duration ratio
  min_ratio_state <- season_data |>
                      filter(total_weeks > 42) |>
                      filter(ratio == min(ratio, na.rm = TRUE)) |>
                      select(state)
  
  # Append results to the flu_duration_ratio data frame
  flu_duration_ratio <- bind_rows(flu_duration_ratio, 
                                  data.frame(Season = ii, 
                                             Max_State = max_ratio_state$state, 
                                             Min_State = min_ratio_state$state))
}

flu_duration_ratio
  Season Max_State    Min_State
1   2015    Oregon     Missouri
2   2016   Vermont   California
3   2016   Vermont     Kentucky
4   2017    Oregon South Dakota
5   2018   Vermont South Dakota

We can see that the two tables generated produced a slightly different list of states with longest or shortest duration, especially when determing states with shortest duration. I will use the table calculated with ratio to continue my analysis since it has taken into account the issue with missing values at varying levels.

Now I will look at the clade distribution of all sequences collected between 2015 and 2019. First, I will write a function that will graph a stacked line chart given the season, state, and subtype (H1N1, H3N2, or influenza B). Because I don’t have complete sequencing data for every week, I am graphing clade distribution by month.

# Create a function to make stacked line chart 
get_plot <- function(gisaid, plot_season, plot_state, plot_subtype){
  
  # Filter gisaid df
  data1 <- gisaid |>
            filter(season %in% plot_season) |>
            filter(state %in% plot_state) |>
            filter(subtype %in% plot_subtype)
  
  # Create plot titles 
  
  state_count <- length(plot_state)

  plot_title <- paste0("Influenza clade distribution in ", 
                       ifelse(state_count == 50, "all 50 states", 
                              paste(plot_state, collapse = ", ")),
                       ', Subtype: ', 
                       plot_subtype,' \nSeason: ', 
                       as.character(paste(plot_season, collapse = " ")))
  

  # Round collection dates to first day of the week 
  
  # Make a column that is the rounded date to the month
  data1$date <- as.Date(format(data1$collection_date, "%Y-%m-01"))
  
  # Generate dates by week from min to max 
  d_min <- min(data1$date)
  d_max <- max(data1$date)
  
  # Generate a sequence of dates for each week
  date_week <- seq(from = d_min, to = d_max, by = "week")
  
  # Round each date to the first day of the month
  date_week <- as.Date(format(date_week, "%Y-%m-01"))
  
  # Generate lists of unique clades
  clades_unique <- unique(data1$clade)
  
  # Make a blank df for every date and clade combination 
  data2 <- expand.grid(date = date_week, clade = clades_unique)
  
  # Add a column for prop_clade (you can fill it with NAs initially)
  data2$prop_clade <- 0
  
  # Populate df based on data1 
  for(ii in 1:nrow(data2)){
    temp_w <- data2$date[ii] # Retrieve the week for this row
    temp_data <- subset(data1,data1$date == temp_w) # Select all the data from that week
    temp_table <- data.frame(table(temp_data$clade)) # Get a table of the frequency for the clades of that week
    temp_table$proportion <- temp_table$Freq / sum(temp_table$Freq) # Determine the proportion of that clade
    
    # Populate data2 based on the proportion of temp_table
    temp_c <- as.character(data2$clade[ii])  # Determine the clade of the current row
    temp_table2 <- subset(temp_table,temp_table$Var1 == temp_c) # Determine if that clade was found in the data that week
    if(nrow(temp_table2) > 0){ # If the clade was found for this week then save that clade's proportion to the data2 data frame
      data2$prop_clade[ii] <- temp_table2$proportion[1]
    }
    
    # Prints the completion percentage and the current iteration number to the console every 1000 iterations
    if (ii %% 1000 == 0) {
      # Calculate and print the percentage complete
      percent_complete <- (ii / nrow(data2)) * 100
      cat(sprintf("Progress: %.2f%% complete. Currently on iteration %d\n", percent_complete, ii))
    }
  }
  
  # Populate data3 based on data2
  data3 <- data2
  data3$date_clade <- paste0(as.character(data3$date),'_', 
                             as.character(data3$clade))
  temp_date_prev <- d_min 
  
  # Loop through the weeks and if there is a week with no data, then use the data from the last week.
  for(ii in 1:nrow(data3)){
    
    # Part 1: Detect which weeks have no data.
    temp_date <- data3$date[ii]
    temp_df <- subset(data3,data3$date == temp_date)
    temp_table <- data.frame(table(temp_df$prop_clade))
    # If there is data for this week then we will have more than one row in this data frame (after filtering out 0 for Var1)
    temp_table <- subset(temp_table,temp_table$Var1 != 0)
    
    # Part 2: Populate with last week's data.
    # If there is no data for this week (nrow == 0), then use the previous week's data.
    # Rationale: Determine which dates have no data, then look for the last week's data, make a lookup table which uses this week's data and then we populate data3 with that.
    if(nrow(temp_table) < 1){ # If there is no data
      # Since this week had no reported data, make a lookup table from the last week's data and populate this week's using it
      temp_last <- subset(data3,data3$date == temp_date_prev)
      # Modify the lookup table so that we can assign this week's date instead of last week's date
      temp_last$date <- temp_date
      temp_last$date_clade <- paste0(as.character(temp_last$date),
                                     '_',as.character(temp_last$clade))
      # Assign the data to be found in the table
      lookup_tab <- temp_last$prop_clade
      # Give the data names
      names(lookup_tab) <- temp_last$date_clade
      # Find indices where a match is found in the lookup table
      matching_indices <- match(data3$date_clade, names(lookup_tab))
      
      # Update the values in data3$prop_clade only where a match is found
      data3$prop_clade[!is.na(matching_indices)] <- lookup_tab[matching_indices[!is.na(matching_indices)]]
      
    }
    
    # Remember this as the previous date.
    temp_date_prev <- temp_date
    
    # Check if ii is a multiple of 1000
    if (ii %% 1000 == 0) {
      # Calculate and print the percentage complete
      percent_complete <- (ii / nrow(data2)) * 100
      cat(sprintf("Progress: %.2f%% complete. Currently on iteration %d\n", percent_complete, ii))
    }
  }
  
  breaks_num <- 50
  label_count <- date_week
  
  # Get the colors 
  # Create a gradient from red to violet
  color_gradient <- colorRampPalette(c("red", "orange", "yellow", 
                                       "blue", "violet"))
  # Generate the colors for each clade
  color_clade <- color_gradient(length(clades_unique))
  
  # Specify the breaks for the x-axis labels
  breaks <- seq(min(data3$date), max(data3$date), by = "1 months")
  
  # Format breaks to display as "Month Year"
  formatted_labels <- format(breaks, "%b %Y")
  
  # Plot with custom x-axis labels
  plot <- ggplot(data3, aes(fill = clade, y = prop_clade, x = date, group = clade)) +
    geom_area(size = 0.5, color = "white") +
    labs(fill = "", x = "", y = "Proportion") +
    ggtitle(plot_title) +
    theme(axis.text = element_text(size = 9), 
          axis.text.x = element_text(angle = 45, hjust = 1),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black")) +
    guides(fill = guide_legend(title = "Clade")) +
    scale_fill_manual(values = color_clade) +
    scale_x_date(breaks = breaks, labels = formatted_labels)
  
  # Print the plot
  plot
}
# Make stacked line chart for all H1N1 sequences in all 50 states
plot_season <- c(2015:2018)
plot_state <- states_list
plot_subtype <- "A / H1N1"
get_plot(gisaid, plot_season, plot_state, plot_subtype)
Progress: 40.65% complete. Currently on iteration 1000
Progress: 81.30% complete. Currently on iteration 2000
Progress: 40.65% complete. Currently on iteration 1000
Progress: 81.30% complete. Currently on iteration 2000

# Make stacked line chart for all H3N2 sequences in all 50 states
plot_season <- c(2015:2018)
plot_state <- states_list
plot_subtype <- "A / H3N2"
get_plot(gisaid, plot_season, plot_state, plot_subtype)
Progress: 34.84% complete. Currently on iteration 1000
Progress: 69.69% complete. Currently on iteration 2000
Progress: 34.84% complete. Currently on iteration 1000
Progress: 69.69% complete. Currently on iteration 2000

# Make stacked line chart for all IBV sequences in all 50 states
plot_season <- c(2015:2018)
plot_state <- states_list
plot_subtype <- "B"
get_plot(gisaid, plot_season, plot_state, plot_subtype)
Progress: 81.30% complete. Currently on iteration 1000
Progress: 81.30% complete. Currently on iteration 1000

flu_season <- c(2015)
max_min_duration <- c("Oregon", "Missouri")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Create an empty list to store plots
plots_list <- list()

# Generate plots for all combinations of variables
for (season in flu_season) {
  for (state in max_min_duration) {
    for (subtype in subtypes) {
      tryCatch({
        # Generate a plot for the current combination
        plot <- get_plot(gisaid, season, state, subtype)
        # Store the plot in the list
        plots_list[[paste(season, state, subtype)]] <- plot
      }, error = function(e) {
        # Print a message when there is an error (e.g., no data for the combination)
        cat("Error:", e$message, "\n")
      })
    }
  }
}
plot_grid(plotlist = plots_list[1:3], ncol = 1)

plot_grid(plotlist = plots_list[4:6], ncol = 1)

clinical_labs |>
        filter(season == 2015) |>
        filter(state == "Oregon") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#526A83") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2015 Oregon Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

clinical_labs |>
        filter(season == 2015) |>
        filter(state == "Missouri") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#68855C") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2015 Missouri Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

flu_season <- c(2016)
max_min_duration <- c("Vermont", "California", "Kentucky")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Create an empty list to store plots
plots_list <- list()

# Generate plots for all combinations of variables
for (season in flu_season) {
  for (state in max_min_duration) {
    for (subtype in subtypes) {
      tryCatch({
        # Generate a plot for the current combination
        plot <- get_plot(gisaid, season, state, subtype)
        # Store the plot in the list
        plots_list[[paste(season, state, subtype)]] <- plot
      }, error = function(e) {
        # Print a message when there is an error (e.g., no data for the combination)
        cat("Error:", e$message, "\n")
      })
    }
  }
}
Error: 'to' must be a finite number 
plot_grid(plotlist = plots_list[1:2], ncol = 1)

plot_grid(plotlist = plots_list[3:5], ncol = 1)

plot_grid(plotlist = plots_list[6:8], ncol = 1)

clinical_labs |>
        filter(season == 2016) |>
        filter(state == "Vermont") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#526A83") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2016 Vermont Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

clinical_labs |>
        filter(season == 2016) |>
        filter(state == "California") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#68855C") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2016 California Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

clinical_labs |>
        filter(season == 2016) |>
        filter(state == "Kentucky") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#68855C") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2016 Kentucky Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

flu_season <- c(2017)
max_min_duration <- c("Oregon", "South Dakota")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Create an empty list to store plots
plots_list <- list()

# Generate plots for all combinations of variables
for (season in flu_season) {
  for (state in max_min_duration) {
    for (subtype in subtypes) {
      tryCatch({
        # Generate a plot for the current combination
        plot <- get_plot(gisaid, season, state, subtype)
        # Store the plot in the list
        plots_list[[paste(season, state, subtype)]] <- plot
      }, error = function(e) {
        # Print a message when there is an error (e.g., no data for the combination)
        cat("Error:", e$message, "\n")
      })
    }
  }
}
plot_grid(plotlist = plots_list[1:3], ncol = 1)

plot_grid(plotlist = plots_list[4:6], ncol = 1)

clinical_labs |>
        filter(season == 2017) |>
        filter(state == "Oregon") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = state)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#526A83") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2017 Oregon Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

clinical_labs |>
        filter(season == 2017) |>
        filter(state == "South Dakota") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#68855C") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2017 South Dakota Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

flu_season <- c(2018)
max_min_duration <- c("Vermont", "South Dakota")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Create an empty list to store plots
plots_list <- list()

# Generate plots for all combinations of variables
for (season in flu_season) {
  for (state in max_min_duration) {
    for (subtype in subtypes) {
      tryCatch({
        # Generate a plot for the current combination
        plot <- get_plot(gisaid, season, state, subtype)
        # Store the plot in the list
        plots_list[[paste(season, state, subtype)]] <- plot
      }, error = function(e) {
        # Print a message when there is an error (e.g., no data for the combination)
        cat("Error:", e$message, "\n")
      })
    }
  }
}
plot_grid(plotlist = plots_list[1:3], ncol = 1)

plot_grid(plotlist = plots_list[4:6], ncol = 1)

#ggsave("2018_min.tiff", device = tiff, width = 15, height = 7.5, dpi = 700)

clinical_labs |>
        filter(season == 2018) |>
        filter(state == "Vermont") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#526A83") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2018 Vermont Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

clinical_labs |>
        filter(season == 2018) |>
        filter(state == "South Dakota") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
        scale_fill_manual(values = "#68855C") +
        scale_x_discrete(breaks = 1:52, labels = c(40:52, 1:39)) +
        scale_y_continuous(limits = c(0, 60)) +
        labs(x = "Week", y = "Percent Positive", 
             title = "2018 South Dakota Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

#ggsave("2018_southdakota.tiff", device = tiff, dpi = 700, width = 10, height = 5)

Conclusion

Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

References

U.S. Influenza Surveillance: Purpose and Methods: https://www.cdc.gov/flu/weekly/overview.htm